QRS detection and classification in Holter ECG data in one inference step

While various QRS detection and classification methods were developed in the past, the Holter ECG data acquired during daily activities by wearable devices represent new challenges such as increased noise and artefacts due to patient movements. Here, we present a deep-learning model to detect and classify QRS complexes in single-lead Holter ECG. We introduce a novel approach, delivering QRS detection and classification in one inference step. We used a private dataset (12,111 Holter ECG recordings, length of 30 s) for training, validation, and testing the method. Twelve public databases were used to further test method performance. We built a software tool to rapidly annotate QRS complexes in a private dataset, and we annotated 619,681 QRS complexes. The standardised and down-sampled ECG signal forms a 30-s long input for the deep-learning model. The model consists of five ResNet blocks and a gated recurrent unit layer. The model's output is a 30-s long 4-channel probability vector (no-QRS, normal QRS, premature ventricular contraction, premature atrial contraction). Output probabilities are post-processed to receive predicted QRS annotation marks. For the QRS detection task, the proposed method achieved the F1 score of 0.99 on the private test set. An overall mean F1 cross-database score through twelve external public databases was 0.96 ± 0.06. In terms of QRS classification, the presented method showed micro and macro F1 scores of 0.96 and 0.74 on the private test set, respectively. Cross-database results using four external public datasets showed micro and macro F1 scores of 0.95 ± 0.03 and 0.73 ± 0.06, respectively. Presented results showed that QRS detection and classification could be reliably computed in one inference step. The cross-database tests showed higher overall QRS detection performance than any of compared methods.


Data
We used private (Fig. 3A) and public (Fig. 3B) ECG datasets in this study. The anonymised, private ECG dataset was collected during routine ECG screening and, therefore, was not subject to the ethical committee by Czech law. This private dataset was used for the method development (Fig. 3C) and testing, and public datasets were used only for cross-database tests (Fig. 3D). The lead "I" was used if the dataset contained multiple leads. If it was not present, the first ECG lead was used.   www.nature.com/scientificreports/ acquired from patients during usual daily activities and often contained a high amount of noise (Fig. 1B,D). We have developed a software tool, "QRS Marker". Two specialists with more than five years of experience with QRS detection and classification semi-automatically marked 619,681 QRS complexes in this tool. Next, data were split into training (80%), validation (10%), and testing (10%) datasets (Table 1) in an out-of-patient manner.
Public datasets. We also used twelve public databases (1,602,960 QRS complexes from 3,050 recordings, sampling frequency from 128 to 1000 Hz) to test QRS detection performance (  29 . These twelve databases (Table 2) were used for cross-database tests in the QRS detection task. Databases EDB, INCART, MIT-BIH, and SVDB contained QRS classes and were used for cross-database tests of QRS classification performance ( Table 2, rows highlighted with *). The proposed method is designed to classify into normal beats, PAC and PVC; therefore, if the QRS complexes were classified in more detail (e.g., a paced beat), the closest possible option was selected (e.g., a normal beat). . Dataflow in the presented study: private MDT data (A) were split into training, validation, and test subsets. Training and validation subsets were used to develop the proposed method (C). Next, the method performance was measured (D) using MDT private test data and data from twelve public databases (B). For the QRS classification task, only four databases were used.  www.nature.com/scientificreports/ Training data augmentation. We randomly inverted each signal with a probability of 0.5 and cropped the signal to random 30 s, modifying the data for each batch. We used weighted oversampling to balance the counts of the QRS types we trained on.

Method
All experiments were performed in accordance with relevant guidelines and regulations. The method is designed to work as in Fig Preprocessing. Before feeding the training signals into the model, we resampled the signal to 100 Hz and standardised the signal independently to have zero mean and unit variance (Fig. 5A). Target data (y) for the model were prepared as follows: each QRS location was widened to 10 samples to create a four-channel segmentation mask (as in Fig. 4D).
Model architecture and training. The developed model consists of five residual blocks (Fig. 5B,C), a gated recurrent layer (Fig. 5D), and a fully connected layer (Fig. 5E). The model outputs a four-channel tensor (Fig. 5F). Each residual block consists of several convolutional layers. We used a batch size of 64, a cross-entropy loss function, an AdamW optimiser with a learning rate of 0.001, clipped the gradient L2 norm to 1.0, no weight decay.
Post-processing. The network outputs the likelihood (Fig. 4F) of the four different QRS classes (no QRS, normal QRS, atrial QRS, ventricular QRS) for every input sample. We take the class with the maximum likelihood for every sample and post-process the resulting segmentation mask to get a list of the QRS peaks. First, we calculate the centers of the segmentation mask and save them into a list of potential peaks. Then, we remove lower peaks that are too close (< 0.15 s) to stronger peaks, as such a low distance between beats is physiologically improbable.
Compared QRS detectors and used metrics. For   www.nature.com/scientificreports/ differ from the performance reported by respective papers since we used all available data from all datasets; we implemented detectors by Elgendi 2 and Malik 3 using respective papers. We used the F1 score to compare and evaluate results. A detected QRS complex was considered true positive when its annotation mark was closer than 0.1 s (inclusive) to an annotation mark prepared by an expert. The false positive or false negative cases were considered when a beat was missing in expert annotations or detected QRS complexes.

Results
The model was built using the PyTorch framework 31 and trained in 70 epochs using the private MDT dataset. We separately evaluated QRS detection performance and QRS classification performance; we also evaluated computational method performance. QRS detection performance. We received training, validation, and testing F1 scores of 0.991, 0.990, and 0.992 for the detection task using the MDT test set. We also provided a cross-database test to evaluate detection performance on twelve public datasets, showing a mean F1 score of 0.96 ± 0.06. The detection performance was compared to other methods using all twelve test datasets. We received a maximal mean F1 score of 0.961 using the proposed method, followed by the Malik method 3 (0.955) and XQRS detector from the WFDB 30 Python package. We also observed how the used databases were difficult for tested detectors. Overall F1 results by all detectors per database showed that the easiest database to detect was the STDB 25

QRS classification performance.
We evaluated the proposed method to classify QRS complexes into three groups-normal beat, premature ventricular contraction, and premature atrial contraction (Table 4-the last row). We reached an overall classification F1 performance of 0.96 and 0.74 for micro and macro computation in the MDT test set, respectively. Cross-database tests for QRS classification ( Table 4, the first four rows) showed average micro and macro F1 scores of 0.95 ± 0.03 and 0.73 ± 0.06, respectively.

Method computational performance.
We measured the processing time of all compared methods using all testing datasets (excluding the CYBHi) to evaluate computational performance. The average process- www.nature.com/scientificreports/ ing time per record is shown in Fig. 7. The comparison was obtained using a computer with Intel® Xeon® Gold 6248R CPU running at 3.00 GHz. Data were supplied to algorithms one by one, and we disabled GPU, which disadvantaged the proposed method (Fig. 7).

Discussion
The presented method showed the highest overall QRS detection F1 score in compared methods (Table 3) when using all test datasets. We were generally focused on Holter ECG data acquired during usual daily activities, and we received the best score of tested methods in the MDT dataset. The highest overall score might reflect that we used a high amount of disrupted ECG data for training. Table 3 (the row "MDT") demonstrates how different methods can detect QRS in noisy data. Figure 6 shows examples of non-trivial Holter ECG and results of presented and compared detection methods. Figure 6A demonstrates that four methods overlooked PVCs with abnormally low amplitude; Fig. 6B shows how methods react to signal disturbance and how most of them capture noise instead of QRS if they are very close (19th second). Finally, Fig. 6-C demonstrates how non-usual PVC couplets and noise may confuse detectors.
We also compared the presented method to the deep-learning method 11 trained on the CYBHi dataset 23 and tested on MIT-BIH 12 dataset with an F1 score of 0.96. Our method slightly outperforms the compared method on MIT-BIH, but on the other hand, we used a significantly more complex network structure.
The important benefit of the presented method is that it classifies QRS complexes into three groups. Our results show that the weakest point of QRS classification is in the PAC class (Table 4). Further investigation revealed that in most cases, false PACs are generated inside blocks of atrial fibrillation where the presented method tends to report PACs. We also found incorrect classifications in long SVT runs (series of PACs running on high heart rate).
A limitation in comparison to most other methods is processing time, as shown in Fig. 7. However, this can be overpassed when the model uses a GPU during inference. In such a case, inference time can be decreased approximately 10-30 times depending on the specific hardware and batch size.

Conclusion
We presented a novel deep learning method for QRS detection and classification in one inference step. The method was evaluated on twelve public datasets not used for model development. This cross-database test showed higher overall QRS detection performance than other compared methods. Furthermore, we showed how compared QRS detectors behave in non-trivial situations common in Holter ECG. We also demonstrated that both QRS classification and detection could be combined into one deep-learning model. Therefore, the usual processing chain to analyse heart rhythm can be simplified.